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Abstract 

We introduce a fast and computationally low-cost SVM-based novel feature elimi- 
nation method, and show it is effective on high-dimensional data with e.g. thousands of 
features. The method's premise is to utilize light classifier retraining, rather than full 
SVM retraining, at each feature elimination step and be consistent with both margin 
maximization central to SVM learning and a well-known upper bound in Statistical 
Learning Theory on the expected risk of making a classification error. On high-d gene 
and other datasets, we show that our proposed method achieves higher or competitive 
generalization accuracy, lower or similar test set classification error rate, than earlier 
MFE-based methods and RFE, even when the earlier methods stepwise utilize full 
SVM retraining i.e. re-learn SVM weights from scratch. We find that our proposed 
method and the previously proposed MFE-LO are the two best performers with respect 
to generalization accuracy, and point out some limitations of MFE-LO. Our proposed 
method's accuracy may increase by simply performing hyperparameter selection during 
the elimination process rather than solely during pre-elimination. We also point out 
that the results herein for some previously published MFE-based methods, including 
MFE-LO, are the result of a more proper experimental evaluation of those particular 
methods. Lastly, we note it is our intention to compare in the future our method's 
performance with additional methods, such as sparse estimation methods, e.g. for the 
case of linear classification the state-of-the-art Lasso and 11 logistic regression. For 
the case of nonlinear kernels, however, our proposed method may potentially have an 
advantage e.g. for some data domains, especially if the hyperparameter tuning bene- 
fits are sought; for the nonlinear case, we present extensive results obtained with our 
proposed method. 
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1 Introduction, Background, Related Work 

Subset selection draws much interest; given an initially M-dimensional - often high-dimensional - 
dataset, the aim is to select a subset of this initial feature set, for purposes such as: 1) identifying 
a small subset of features - i.e. "markers" - necessary for making good predictions, especially 
important when M is huge and the number of samples is small; e.g. "biomarkers" for biomedical 
imaging studies e.g. [H El [71 ttSJ |24] and gene studies in bioinformatics e.g. [HI HE]; 2) combatting 
the curse of dimensionality (COD) i.e. the potential degradation of generalization accuracy when 
the initial feature dimensionality M is increased beyond a certain point [21]; 3) reducing the 
complexity of the classification operation - both memory storage and computation - since accuracy 
gains attainable despite COD by a large feature set may be small. 



Since there are 2 — 1 possible subsets, exhaustive search is practically prohibitive for even 
modest M e.g. less than thousands. Subset selection methods, which are thus heuristic, include 
"front-end" (aka "filtering") methods, "wrapper" methods, and "embedded" methods [IQiilSj. To 
briefly discuss these three categories: 1) Front- end methods: Some of these methods evaluate dis- 
crimination power of small feature groups prior to classifier training to combine small groups to 
form a final (retained) feature subset. Although this is robust to overfitting, it tends to not achieve 
sufficiently high generalization accuracy because features that can significantly improve generaliza- 
tion accuracy when taken with other features become ignored [10^ [T3] Past work on front-end 
methods includes e.g. [MKIZ]. 2) Wrapper methods include forward selection, backward elimina- 
tion, and bidirectional methods; these methods repeatedly evaluate the classification accuracy of 
candidate feature subsets via i) classifier training (e.g. recursive feature elimination (RFE) method 
used in e.g. [IIIIIS], MFE methods such as MFE-LO and MFE-Slack [1]) or ii) without training 
the classifier in the reduced space (e.g. the 'basic MFE' method [Tj); herein we will compare our 
results with both of these subcategories. Other past work on wrapper methods includes e.g. [25]. 
3) Embedded methods aim to suppress irrelevant features via formulating the classifier training's 
optimization problem in a way that e.g. drives to zero the values of many features whereby the 
rest of the features form the retained set. To that end, different norms (e.g. ^i, £2) and optimiza- 
tion approaches have been investigated, e.g. [20], [26], [9], [H], [4j. A premise of such embedded 
methods is the "bet on sparsity" principle for high-d data [20] which is essentially as follows: if the 
number of features far exceeds the number of samples and the true (and unknown) "weights" for 
the features are Gaussian, neither ii nor £2 will estimate the weights well, due to the data being 
too little for estimating these nonzero weights due to COD, but solution sparsity can nevertheless 
be encouraged via use of £1 (whereby numerous features will be driven to zero) o 



1.1 Brief review of SVMs 

Consider a labeled training set {(xn,yn)}, n £ M = {!,..., A^}, where the n-th sample Xn = 
[xn^i ■ ■ ■ Xn,M\^ G 1^^^ has class label y„, G {il}- For subscripted vectors, e.g. Xn, the i-th. 
coordinate value Xn,i is also denoted Xnj. A hyperplane acting as a binary (two-class) decision 
function is defined by /(x) = w'^x -|- wq, w E M.^^ , wq E M. Denoting g„ = g{'x.n) = Vnfi^n) 
and g = [51 ... gN^^ , the signed distance from a data point Xn to the decision boundary is jj^- 
The decision boundary is a separating one if > Vn, and its margin is accordingly defined as 
7 = ™| | ^|^" . An SVM is a linear or generalized linear two-class classifier that learns a separator for 
the training set with maximum margin. The "support vectors", denoted by set S = {si . . . st} 
(with index set S = {1, 2, T}), used to specify the SVM solution, are a subset of the training 
points at margin distance to the decision boundary. In the linear case, the SVM weight vector is 
given by w = ^ ^s^Us^^ii^ where are scalar Lagrange multipliers. In the generalized linear 

k&S 

(nonlinear) case, denoting i?5>(x) = [(/>i(x), . . . , (j)L['x.y\^ where (/>i(x) are nonlinear functions of the x 
coordinates, inner products between </>(x) and (/>(u) that can be efficiently computed via a positive 
definite kernel function /r(x, u) = (x)(/)(u) are of particular interest; in this case both and 
w itself need not be explicitly defined since both the SVM discriminant function /(•) and the SVM 
weight vector squared 2-norm can be expressed solely in terms of the kernel, i.e.: 

/(x) = ^ As^ysk-^(sk,x) +u'o, (1) 
kes 

ll^ll^ = X]Z]^«kyskAs,2/si-fsr(sk,si). (2) 

kes les 

^Note that a feature useless by itself for class separation (e.g. producing fully overlapping class-conditional 
densities) can improve accuracy when taken with other features, and two useless features can be useful 
together [IHIIIS]- 

^Embedded methods estimate how a cost function (e.g. an objective function in an optimization problem 
formulation) will change with movements in a feature subspace - often performed in a backward or forward 
framework, this aooroach is able to oroduce nested subsets of features fTOl . 



This approach (the "kernel trick"), where K{-,-) is exphcitly specified and provided to the SVM 
training, is known as the "nonlinear kernel case". 

For linear or nonlinear kernel case, the basic SVM training problem is: 

min^||w|p s.t. ynf{^n)>'^,^n&M (3) 

Recall that f{^n) in © simply stands for w'^x„ + wq in the linear case and (w, 0(x„)) + wq in the 
nonlinear kernel case; i.e., note that w and wq are a part of the constraints in ^ (even though 
above they do not appear so, due to ([3]) having been conveniently stated generically to cover both 
linear and nonlinear cases). The relationship of ([3]) to margin maximization can be understood as 
follows [T]. Assuming we have a separator (i.e. the training data is separable), with margin 7 equal 
to ™'°"w[|^"'* ' note that g{-) = yf{-) can be amplitude-scaled by an arbitrary nonzero constant p 
without altering the decision boundary. In particular, by forming g = pg, where p = — — — r so 
as to make min„g'(x„) equal to 1 (for consistency with the constraints), the margin 7 is given by 

™'"'|w | |^"'* ~ TfwTf ■ thus see the well-known result that, for this special choice of p, maximizing 

margin is equivalent to minimizing the squared weight vector 2-norm. The SVM training problem 
can alternatively be posed as 



1 ^ 

min -||wI|2 + cVCn s.t. > 0, yn/(x„) > 1 - Cn, Vn e AT (4) 



SO as to allow slackness (^ = [^1, CnV) in the margin constraints. ([3]) allows some support 
vectors to be practically closer than others to the hyperplane (by nonnegative slackness amounts 
^„), thus handling both margin violations (i.e., ^„ > 0) and nonseparable data (a classification 
error occurs for sample n if ^„ > 1)|1 For choosing the SVM training parameter C (and other SVM 
hyperparameters in the nonlinear kernel case), the standard practice of using a validation or cross- 
validation procedure [3 [12] can be employed. The relationship of to margin maximization can 
be understood as follows p]. Notice in (jj]) that if C is made sufficiently large, no margin slackness 
will be tolerated and minimizing reduces to minimizing the squared weight vector 2-norm and, 
thus, to maximizing margin. We thus see that ([4]) is a generalization of strict margin maximization 
([3]) that specializes to strict margin maximization when C is made sufficiently large. 



1.2 Related work on SVM-based feature elimination 

Relating the SVM concepts in Sec. 11.11 to feature elimination algorithms, [1] proposed both a 
method called "basic MFE" that eliminates the feature whose elimination preserves maximum 
(positive) margin in the reduced space, and a second, analogous method called "MFE-slack" that 
eliminates the feature whose elimination yields the smallest SVM objective function ([4]) in the 
reduced space. That is, in accordance with the discussion above in Sec. II. H the theoretical basis 
for 'basic MFE' is strict margin maximization, and the theoretical basis for MFE-slack is the 
generalization of strict margin maximization, with the two methods being analogous, as described 
above and in pj. 

To now re-state these two feature elimination methods presented in [1], let 7^ denote the retained 
features set, which is initially the full set of available features to eliminate from and becomes smaller 
during the process of eliminating features, and let (w^,tt;Q^) denote the associated SVM weights. 
Let q~"^ (or alternately q^\"^) denote quantity q upon (candidate or actual) elimination of feature 
m, with additionally indicating a choice of "anchor sample" Ua discussed below. As given 

by [1], the feature m* eliminated by 'basic MFE' is 

m* = arg max min [w"™'! | (5) 

mG{mG7?.|g~™>0 VleAf} " 



■^Note that w and wq are a part of constraints ([U and ([4]) even though they do not appear there. 



and, similarly, the feature m* eliminated by MFE-Slack (when paired with "anchor sample" n* 
discussed next) is 

1 ^ 

(m*,n*) = arg min min -||w-"^||2(p-™'"«)2 + cV C"''"^""''"'^ (6) 

'»6^nae{/GAr|g-'">0} 2 ^ 

As described in [1], this is a discrete optimization wherein, for each candidate feature for elimina- 
tion, m, every (correctly classified) sample is evaluated as the potential margin-setting sample ria 
(dubbed "anchor sample"), with both the weight vector 2-norm and the slackness values evaluated 
,o.^feature-ehminatiorE, and subsequently the optimal (feature,sample) pair (m*,n*) that, post- 
elimination, minimizes (jH) over all the discrete choices {{m,na)} is selected, identifying the feature 
m* to be eliminated. 

Note that 'basic MFE' can only be used when data is separable, whereas its counterpart MFE- 
slack is not only consistent with margin maximization central to SVM learning but also deals with 
the case of nonseparable data by incorporating margin slackness into the elimination criterion as 
done by the standard SVM formulation ([4]). There is also the "hybrid" method - introduced in 
[Ij and referred to herein as MFEhybrid - which is defined such that 'basic MFE' is used at the 
elimination steps the data is separable and MFE-Slack is used at the other steps (when data is 
nonseparable) . 

The third feature elimination approach proposed in [Tj is the LO (Little Optimization) ap- 
proach; an MFE variant which, like 'basic MFE', requires the data to be separable upon candidate 
elimination of a feature. The premise of LO is that one can employ a type of classifier retraining 
with little (exceptionally modest) computation to obtain a margin guaranteed to be larger than the 
margin that 'basic MFE' can obtain alone. Specifically, given SVM weights (w, wq) in the reduced 
space (upon candidate elimination of a feature) , LO considers the new parameterized weight vector 
{Aw, Wo), where A and wq are scalar parameters to be optimized, with w held fixed. That is, 
allowing adjusting the squared weight vector 2-norm and the intercept wq (with the weight vec- 
tor orientation fixed), LO poses the standard SVM training problem but optimizing only in this 
too-dimensional {A,wq) parameter space, stated as (j7]) for the linear kernel case and as ([8]) for the 
nonlinear kernel case [1]: 

min^^ s.t. yn{A{w^Xn) + wq) > 1, Vn € A/" (7) 

A, too 

min A^s.t. y„(^( V As^2/s^i^(sfc, x„)) + wq) > l,Vn £ M (8) 

A,wo ^ — ' 

It was discussed in [l] that one can perform candidate elimination of each feature and then perform 
LO in the reduced space, so as to then pick the candidate elimination that leads to the largest 
post-LO margin. This elimination method, where LO is embedded into the elimination decision, is 
referred to as MFE-LOemb herein. 

Next we propose a new novel feature elimination method in Sec. [2] where we also discuss 
limitations of the above earlier MFE methods as motivation for our proposed method. Based on 
the background provided by Sec. [21 we then propose a second method in Sec. [3] which performs 
better. 



2 MFE-QP: new feature elimination method 

Given SVM linear weights {w,wq), we consider the new parameterized weight vector (aw, 6), to 
jointly optimize the scalar parameters a, b, and N slacknesses, with w held fixed. We thus pose 

"^To arrive at (jH]) from (j?]) under such consideration (i.e., upon candidate elimination of feature m and 
in accordance with a particular choice of anchor sample Ua), one calculates the value as 
(so that will be 1 as discussed by the p discussion in Sec. Il.ip . based on which the square of 

the weight vector 2-norm and slackness values are then measured and plugged into the objective function 
(HI {i.e. the squared 2-norm measurement ||w~™|p(p~'"'"")^ shown in ^ and the slackness measurement 
maxCO. 1 - o-'"'"»or'"^ Vn which is the "£-™."o^-m,n„„ j^q^^ 



the standard slackness-incorporating SVM problem (jlj) consistent with margin maximization but 
only optimize in this (a, b, ^) parameter space: 

min^a^llwlp + C7 VCn s.t.^n > 0, y„(aw'^x„ + 6) > 1 - Cn,Vn G (9) 

a,b,£ Z ^ — ' 
- n 

For the nonlinear kernel case, where w need not be explicitly defined as discussed in Sec. II. H the 
problem Q would be written as (fTOj) equivalently; i.e., the sum w"'"x„ is written as ^ As^ysk^(sk5 

min Ja^l |w| |2 + C V s.t. <e„ > 0, y„(a( V As^ysk^(sk, x„)) + 6) > 1 - ^n, Vn G (10) 

- n fce5 

At first look, our formulations ([9]) and (jlOp may seem computationally costly and complicated, 
but they are in fact each equivalent to a very simple problem with little computational cost - the 
1-dimensional SVM - as follows. Since w is an input held fixed (i.e. it is not one of the parameters 
optimized by the problem) and ||w|| is only a fixed scalar (and not one of the parameters optimized 
by the problem), we can define a new (scalar) parameter w = o||w|| as a replacement for parameter 
a and new (scalar) data variables Zn = ■^j]^^ s-s replacement for data variables x„lf| With this 
change of variables, each of ([9]) and (jlOp can be rewritten as follows. 

min^u;^ + CVen s.t.^n > 0, yn{wzn + b) > 1 - en,Vn gAA (11) 

w,b,£ Z ^ — ' 
- n 

Notice that (jlip is simply a Id SVM the scalars w and b together define the Id SVM decision 
boundary i.e. a Id threshold. Our proposed approach of ([9]) and (fTO]l - subsequently equivalently 
written as (jlip - is named "QP" . Since the QP approach requires very little computation it can be 
performed in conjunction with each feature elimination step. Specifically, given a current retained 
feature set TZ, we can perform candidate elimination of each feature m £ TZ, train the Id SVM 
(fTTI) in the reduced space (i.e. under candidate retained feature set Tl\{m}), and then pick the 
candidate elimination that leads to the smallest SVM objective function value (jlip in the reduced 
space. This embedding of QP into the elimination decision - consistent with margin maximization 
central to SVM learning - defines our proposed feature elimination method, named MFE-QPemb, 
given by (fT2]) below; i.e. the feature m* proposed to be eliminated is: 

m* = minmin^it;^ + C VCn s-i- > 0, y„(ii;z^™ + 6) > 1 - ^„,Vn G A/" (12) 

m£TZw,b,£ 2 ^ — ' 
- n 

Notice that the only computation required by this method is to - for each candidate feature m 
for elimination - compute the Id (scalar) data points z~"^ (Vn G TV) from training points x„ (using 
recursion, to reduce computation), then train a Id SVM with these scalar training points z"™" to 
obtain the {w, b,^^) triplet; and lastly plug these obtained {w, b,^,) values into (fT2|) to determine the 
winning candidate m*. 

Rather than solving the Id SVM (jlip in the usual way (by using a "dual" solver), an SVM 
training algorithm specifically designed for the Id case for significantly reduced computation can 
be utilized; this algorithm with 0{nlogn) complexity (the complexity of sorting n numbers), given 
in Appendix m is only for computation reduction for the Id SVM training step and is thus recom- 
mended but not required per se by our proposed QP approach. 

As discussed in Sec. II. 2j a previous method with properties similar to our proposed method 
is MFE-Slack fT]. MFE-Slack is similar due to performing SVM feature elimination that is both 
consistent with margin maximization and incorporates margin slackness. However, a limitation 
of MFE-slack is that it makes no alteration to the decision boundary that results solely from the 

^In the nonlinear kernel case, we define Zn instead as z„ = ^ Askt/sk-^(sk,x„)/||w||. 

kes 



removal of the eliminated featurqj; it is based on discrete optimization, whereas our proposed ap- 
proach QP performs non-discrete optimization - similarly with only little computational cost, in 
the form of a Id SVM - altering the decision boundary so as to obtain a lower, more optimal SVM 
objective function value than MFE-Slack. That is, essentially our proposed approach seeks to pro- 
duce modestly more margin maximization than MFE-Slack at each feature elimination step, within 
the standard slackness-based SVM nonseparability framework Q that both methods are based on. 
For illustrative purposes, in Fig. [H on the 2000-feature Colon Cancer gene dataset representative 
of high-d datasets, average test set classification error rate (an average across "trials'il) is plotted, 
as a function of the number of retained features (which is reduced going from right to left), for 
the nonlinear Gaussian kernel as well as the polynomial kernelH The Figure illustrates that our 
proposed method MFE-QPemb achieved much higher generalization accuracy than MFE-Slack and 
MFEhybrid. The other curves in the plot will be discussed shortly. 

Another motivation for our proposed QP approach is that it enables to avoid "full (SVM) 
retraining" ("FR", the recalculation of all SVM weights from scratch) at individual feature elimi- 
nation steps; i.e., instead of full SVM retraining, one can perform QP which similarly is a type of 
classifier retraining^ Avoiding FR for high-d datasets is important for several reasons. First , if 
FR is performed stepwise (for the sake of stepwise producing optimal SVM quantities i.e. higher 
margin, or smaller SVM objective function value, than obtained by the above methods that do 
not employ FR), the overall feature selection process may, as we will evaluate herein, suffer from 
"overfitting" , i.e. the "overlearning" of data, which is a limitation. This is especially an issue 
for high- dimensional data since overfitting - a cumulative effect - is expected when the number of 
feature elimination steps accumulating is very large. To illustrate this, also included in Fig. [T]are 
results for MFE-Slack-FRs?x6 (a method where FR is performed at each feature elimination step 
su&sequent to the feature elimination decision by MFE-Slack) and similarly MFEhybrid-FRsub 
(i.e. MFEhybrid's FR-based counterpart similarly). Notice from the results of these two FR-based 
methods that, while the use of FR has brought an increase in generalization accuracy, these meth- 
ods that are employing stepwise full SVM retraining throughout a very large number of elimination 
steps are achieving less generalization accuracy than our proposed method MFE-QPemb (which has 
the additional benefit of having less computational cost, as we will discuss next); this result can 
perhaps be understood as a type of overfitting. To illustrate that there may not be much overfitting 
for - by contrast - a relatively low- dimensional dataset, we give Fig. [2l where we demonstrate that 
the above FR-based methods are now outperforming our proposed method MFE-QPemb, unlike 
in the high-d case of Fig. [J] above Second , FR is computationally much more costly than our 
proposed QP approach. If the initial feature dimensionality is M (e.g. 7000 or more, for gene data), 
at the i-th elimination step FR will need to train an SVM for a very large data dimensionality M — i 
(such as 6999, 6998, . . . ), whereas our proposed method only trains a Id SVM i.e. the number of 
dimensions is 1 throughout elimination steps. 

Moreover, due to incorporating slackness, QP does not require the data to be separable at 
a feature elimination step. By contrast, the previously proposed similar MFE-LOemb method 
(based on the similar LO approach), described in Sec. 11.21 above, requires separability. Requiring 



^That is, throughout elimination steps, relative magnitudes within the set of all originally designed i) SVM 
weights including the hyperplane intercept wq, or equivalently, ii) Lagrange multipliers, remain unchanged. 

"trial", defined in Sec. |4]and summarized here, is a random 50 — 50% split of the dataset into a held- 
out set and a non-heldout set. To be used as initial classifier by all elimination methods, an SVM is trained 
on the entire non-heldout set, using SVM hyperparameter values determined by 5-fold cross-validation on 
this set. Feature elimination is then performed, with error rate measured on the held-out (test) set. 

^Herein, in graphs we zoom in on the end (i.e. the most interesting) segment of the elimination process, 
since methods' performance relative to each other does not significantly vary during the beginning segment 
not shown. 

^Note also that the use of QP does not preclude subsequent use of FR. After the feature elimination 
decision is made using QP, one can opt to use, or not use, full SVM retraining subsequently, before proceeding 
to eliminate another feature. 

^'^In this particular comparison, MFE-QPemb serves a reference role due to being present in both (low-d 
and high-d) Figures. 
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(a) Gaussian kernel. 
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(b) Polynomial kernel. 
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Figure 1: Average test set classification error rate for the high-d Colon Cancer gene dataset 
(a) the Gaussian kernel and (b) the polynomial kernel. 



B 



CD 

g 
o 



CO 

to 
o 



0.8 



0.6 



^ 0.4 



MFE-Slack 
MFE-Slack-FRsub 
RFE-FRsub 
MFE-QPemb 
MFE-QPemb-FRsub 
BMFE-QPemb 
MFE-LOemb 
MFE-LOemb-FRsub 



0.2 




' ' ' ' ' ' 1 

50 100 150 200 250 300 

number of features retained 

Figure 2: Average test set classification error rate for the somewhat high-dimensional UCI Arcene 
dataset (where we only used 300 initial features of this dataset), for the Gaussian kernel. 

separability, i.e. strictly satisfying the margin, may lead to overfitting when training samples at the 
margin are outliers or even mislabeled samples [T]. For illustrative purposes. Fig. [1] also includes 
results for the MFE-LOemb method and its FR-based counterpart MFE-L0emb-FRsu6 (which 
performs FR at each feature elimination step siifeequent to the feature elimination decision by 
MFE-LOemb). Notice that both of these LO-based methods are outperformed by our proposed 
method MFE-QPemb that does not utilize FR, for both nonlinear kernels used. Again, the results 
presented here in this Section serve illustrative purposes; more extensive evaluations are given in 
the Results section (Sec. H]). 

Notice that hyperparameter C is a part of our proposed method's elimination criterion (I12p . and 
thus hyperparameter tuning can be incorporated into the method's elimination decision. Herein we 
do not; i.e. the only hyperparameter tuning we perform is cross-validation during pre- elimination to 
obtain (train) the initial SVM in the original full feature space (jH). In future work, tuning addition- 
ally during elimination may bring an increase to the generalization accuracy of our proposed QP 
approach, without bringing an unacceptable amount of increase in complexity (both computation 
and memory storage; especially if this tuning is solely batch-performed i.e. upon each elimination 
of a batch of features rather than each feature). By contrast, the LO approach - by nature of its 
definition which does not include C or a similar hyperparameter - does not allow incorporating 
hyperparameter tuning into the elimination decision; this is another limitation of the LO approach. 
Herein we will show that our proposed approach QP achieves higher or competitive generalization 
accuracy performance (lower or similar test set classification error rate) compared with LO despite 
not utilizing the option to tune hyperparameters during elimination. 

Above we described and illustrated - for SVM-based "backward feature elimination" - that 
adjusting the current solution in the reduced feature space (upon candidate elimination of a feature), 
by either a light retraining of the classifier (via e.g. QP or LO) or full SVM retraining (FR), brought 
an increase to generalization accuracy; i.e. performed better than 'basic MFE' (or MFEhybrid) and 
MFE-slack. Clearly, there is a discrepancy between this finding and [T]'s earlier finding that stepwise 
classifier retraining did not improve accuracy of the MFE-based feature elimination methods. The 
explanation is simple: not for the first, non-retraining group of methods ('basic MFE', MFE-slack, 



and their hybrid MFEhybrid) in [T] but rather only for the second, retraining group of methods 
in ^ ( "MFE-Retrain" (aka MFEhybrid-FR) , and MFE-LO), it turns out that test set accuracy 
evaluation was performed less properly in [Ij than herein. Specifically, in [l], when the elimination 
sequence - after its entirety was obtained - was applied to the test set (i.e. a retroactive application 
of a complete elimination sequence to a test set), the test set's features were removed in accordance 
with this correct sequence, while each stepwise boundary, used to measure test set error rate, was 
the boundary that results solely from removing that step's eliminated feature. By contrast, herein 
the test set error rate was measured more properly, on post-retraining boundaries propagated 
across the steps of the elimination process that is being applied to the training set. In other words, 
essentially the results we present here for "full (SVM) retraining (FR)" and LO [Ij are more proper 
results than the results presented for these two approaches earlier in [1]. 



3 BMFE: feature elimination that combines "expected 
risk" bound and margin maximization 



The second novel feature elimination method we propose herein, named BMFE-QP, also intended 
primarily for use with high-d data, is based on combining the premise of a well-known theoretically 
motivated upper bound (by Vapnik et al.) on "expected risk" (of making a classification error) 
in Statistical Learning Theory [23\ [12] with the premise of the SVM-based QP feature elimination 
approach we proposed in Sec. [21 We begin with an introductory discussion on the bound in Sec. 
13.11 and discuss our proposed method in Sec. 13.21 

3.1 Background on "expected risk" bound 

Statistical Learning Theory [22[ [23l [12] is concerned with producing a function f : X —?■ y that 
estimates from an input x G X an output y G y considered to be the "truth" associated with 
X. For instance, such estimation may be 2-class classification, with the sought function being 
a hyperplane (i.e. the decision boundary) as herein, and x a vector x G R*^ with class label 
y € {±1}- More precisely, a "learning machine" is defined by a family of possible mappings 
X 1-^ /(x, 0) parameterized by 0, with "learning" (aka training) being the step of employing a 
"training set" X of N input samples in order to choose a particular value whereby one is simply 
left with the desired mapping x i— > /(x). are the "hyperparameters" - the tuning parameters - 
whose values are commonly set (tuned) by a validation procedure such as cross-validation (used in 
our work herein) [SI [12]; e.g., in the SVM linear kernel case, the only hyperparameter is C ([4]). 

The VC dimension h is a property of a family of functions parameterized by 0; it is a measure 
of the learning machine's "capacity", which refers to its ability to learn any training set without 
error [3l 112)^^1 . which, in our work herein on classification, means no classification errors. For a 
learning machine with VC dimension h, the bound (|13p on "expected risk" i?(0) holds with (high) 
probability 1 — rj (where < rj < 1 can be chosen arbitrarily small) jl2[ [22] . 



Rempi@) in ([T3l) . given hy IVn ~ f{^n,Q)\-, is the "empirical risk", the mean error rate 



measured on the training seto S ince i?e772p(0) is a fixed number for a given choice of training 
set {(x„,7/n)} and the chosen by the training, the right-hand side of ([13]) can be calculated if 
one knows /i |3j. A principled way of choosing a learning machine is to choose from among a set 
of candidates the one with the lowest value for the right-hand side of (|13p [31 [12]. To minimize 
the right-hand side, there is a tradeoff between minimizing the first term (the training error) and 

^^VC dimension examples, including examples discussing it in the context of a function family's number 
of parameters, can be found in [3]. 

^^This particular way of writing i?emp(0) assumes the loss function L(u, v) is if u — v and is 1 otherwise; 
e.g., if 3^ = {±1} (used commonly and in our work herein), L{u, v) is ^\v — u\. 




(13) 



n 



the second term (called the "VC confidence (VCC)" term) depending on the complexity of the 
machine through the machine capacity measure h. The candidates can be evaluated according to 
the Structured Risk Minimization (SRM) induction principle, defined to introduce a "structure" 
into the candidate set F'^'^'^'^^ by considering nested subsets Fi C F2 C . . . C Fz (for some integer 
Z, with Fz C /rcands^ with associated known VC dimensions hi < h2 < ■ ■ ■ < hz (or known bounds 
on the VC dimensions). In this way, for a given subset Fi, the goal would be to minimize Remp 
among members of that subset. Accordingly, upon training a number of machines across subsets 
(possibly as few as simply one machine per subset), one can simply choose the machine with the 
least sum of Remp and VCC. SVM was "one of the first practical learning procedures for which 
useful bounds on the VC dimension could be obtained and hence the SRM program could be carried 
out" [12j. Specifically, SVM produces a hyperplane decision function = sian(w ■:x.+bf^ defined 
on training points (or their basis expansions, for the nonlinear kernel case) contained in a ball of 
radius ?o, and, for some scalar A, the VC dimension h of the function family {/w,b : ||w|| < A} 
has the following bounc(^[T2| [22]: 

h < r'^A^ + 1 (14) 

That is, as well-known, the standard SVM formulation (j4|) is a specific implementation - an in- 
stantiation - of the SRM principle for minimizing the (generic) bound ()13p : C serving to achieve a 
tradeoff between and ||w|p in (jj]) is an instantiation of seeking the tradeoff discussed above 

n 

between the two terms of (jlSp l^^l It is, however, important to recognize that ()13p and (j4|) implement 
"tradeoff" in their own mathematically distinct ways; i.e. (jl3p is the generic approach, from which 
differs the specific (and mathematically distinct) ([4]). To see this, first notice that unlike C in ([4]) 
there is no explicit dedicated tradeoff (i.e. tuning) parameter in (jlSp . Second, the VCC term of (jlSp 
and the {{wW^ term of dH) are mathematically quite different; e.g. (jH) does not explicitly include 
the data radius r whereas (fT3l) essentially does via (fHI) (because (fH|) can be used to rewrite (fT3]l 
as ([15]) by simply plugging in the upper bound r^||ti;|p for h). 



mn^<P , r^\\w\\Hlogi2N/r^\\w\\^) + l)-log{7^/4) 

R{@) < Remp[&) + \l j;^ • (15) 



Overall the bound (jlSp is simply a guide - among two machines achieving zero empirical risk 
(i.e. no classification errors in training set), it is possible for the one with the higher VC dimension 



^■^This function is canonical i.e. there exists a "margin-setter" sample x for which y{w • x + 6) is 1, as 
stated in the constraints of the standard SVM formulation 

^^One way the ball radius r can be defined is to define it as the maximum Euclidean distance between any 
two (training) points; i.e. max ||xi — Xj||. 

^^Note, however, that the ball around the data points means that this approach to bounding the VC 
dimension depends on observed values of the features; i.e. as noted by (12] (p. 389) "in a strict sense, the 
VC complexity of the class is not fixed a priori, before seeing the features". It is, however, possible to gain 
insight as to why maximizing the margin, the SVM objective (via the standard slackness-based formulation 
(III)), is considered important for good generalization performance. To that end, note the family of SVM-like 
classifiers named "gap tolerant classifiers" discussed in which are based on both the idea of putting balls 
around data points and hyperplanes - the classifier is specified by the location of a ball (in R*^) and two 
parallel hyperplanes with parallel normal vectors (in R*^), and the decision function classifies points as one 
of two classes so long as these points lie inside the ball but not between the hyperplanes (i.e., so long as the 
points are not members of the so-called "margin set" of points that may lie between the hyperplanes). The 
VC dimension of such a family of classifiers can be controlled by controlling both the maximum allowed ball 
diameter and the minimum allowed perpendicular distance between the two hyperplanes [31 - subsequently 
discussing along this line with examples, [3 argues that it seems very reasonable to conclude that SVMs 
too gain a similar kind of capacity control from their training due to their training objectives being very 
similar to gap tolerant classifiers'. The discussion above suggests that although a rigorous explanation for 
why SVMs often achieve good generalization performance is not provided by SRM alone there is clearly a 
theoretical connection between SVMs' objective of margin maximization and SRM. 

^^Notice that the sum ^ ^„ in (|4|) corresponds to (|13l) 's Remp', it is an upper bound on Remp (the number 

n 

of classification errors), because, by definition of slackness, 1) each term in the sum is nonnegative, and 2) 
the sum contains a term for each classification error Cwith each such term > T). 



to achieve better generalization performance |£0 Next we introduce a novel feature elimination 
method motivated by and consistent with the above theoretical details. 



3.2 BMFE-QP: second new feature elimination method 

In Sec. 13. H we discussed the risk bound (|13p and the theoretical connection between the SRM 
principle (for minimizing this bound) and the standard SVM formulation (j4]). Now let us relate 
this to feature elimination algorithms. In particular, we would like to craft a feature elimination 
criterion based on the bound and consistent with the well-established theory. 

One feature elimination approach (the immediately obvious one) would be to eliminate the 
particular feature m* whose elimination minimizes the bound (jlSp : 



Although this feature elimination criterion (|16p is consistent with theory, it is not crafted 
very well, as follows. Suppose for a candidate feature m for elimination we obtain the quantities 
R^^p ll'^""^!!^ '^i^ SVM and plug them into (fTUj) to evaluate the bound That is, upon 

having already tuned a good tradeoff between these quantities through the tradeoff parameter C 
for specifically the SVM machine context (without explicit use of data radius r^), we would now 
be somewhat abandoning that tradeoff in favor of making an alternate tradeoff attempt via direct 
bound minimization (I16p for the generic machine context which, by contrast, 1) makes explicit use 
of data radius and 2) no use of C. Clearly, from a numerical optimization standpoint, this is 
essentially an "apples and oranges" situation, because the tradeoff suitability of these two particular 
values as a pair comes from having been numerically fused into a single measurement through 
optimization (i.e. the optimized SVM objective function measurement ^||w~"^|p+Cy]f;J^ via the 

n 

fusing parameter C, whereas their use in (|16p - which "unfuses" them and treats them as separate 
entities yet under their fusion-derived values - actually means both 1) use of these two quantities in 
a now numerically mismatched context and 2) loss of information obtained for hyperplane-specific 
tradeoff. It may be wiser to instead preserve the tradeoff that had already been achieved by the 
SVM, since that tradeoff is for the particular decision boundary type we are seeking in the reduced 
space: a hyperplane. That is, it may be wiser to craft a feature elimination criterion that combines 
the bound concept with the whole (intact) form of the SVM objective function; i.e., we shall treat 
the SVM objective function as a single, intact quantity, rather than separate its key quantities 
^emp ll^"*"!!^ ^ seen in (|16p . Thus we propose as feature elimination criterion the product 
of the data radius squared and the SVM objective function; here the particular SVM objective 
function we propose for the criterion is the Id SVM proposed by our QP approach (fTT|) . because, 
1) as illustrated in Sec. [2] and will be shown in the Results section, among all previous MFE- 
based elimination approaches that incorporate slackness to support nonseparability the one that 
achieves the best generalization is our proposed QP approach, and 2) QP has little computational 
complexity since it merely involves Id SVM training. 

m* = min min(^ti;^ + C'S^^n) s.t. ^„ > 0, yn{wzn + 6) > 1 - Cn, Vn E 7\A (17) 

m£TZ ui.fe.f Z ^ — ^ 
' - n 

Our proposed feature elimination method pzp . named BMFE-QPemb, is consistent with theory. 
At an elimination step, when the data is separable (under candidate feature elimination), the first 
term Remp of the bound (|13p is 0, and thus our proposed method (|17p reduces to minimizing the 
(theoretically motivated) bound (jl3p itself; i.e., our method, by minimizing r^||t(;|p, essentially 



"'^'''As an aside, note that the bound is guaranteed not tight when the VCC, which monotonically increases 
with h, exceeds a threshold (which, clearly, would be relative to the maximum value chosen for the loss 
function); e.g., for the 0-1 loss above, a reasonable threshold is 1. For example, for this case, 3 plotted 
VCC versus h/N for N = 10, 000 and ry = 0.05 as well as illustrated that VCC exceeds 1 for h/N > 0.37. 



^^Recall that J2 here is essentially R^mp as described in Sec. 13.11 



minimizes h due to ()14p . and therefore minimizes the second (and only nonzero) term in the bound 
()13p . If, on the other hand, at the ehmination step the data is nonseparable (under candidate 
feature ehmination), we are simply multiplying r"^ with the standard nonseparability counterpart 
of rather than itself; i.e. as very commonly done consistent with theory (as discussed in 
Sec. II. ip . we are utilizing the standard SVM objective function + as an alternative 

to the standard SVM objective function ^lluijl^ (where the latter strictly maximizes the margin 
whereas the former allows for slackness in the margin constraints as discussed in Sec. II. H and 
It should be clear from the above that BMFE-QPemb is consistent with both margin maximization 
(central to SVM learning) and the "expected risk" bound (central to the SRM principle); it is 
consistent with well-established theory. 

Notice in (jl7p that appears on the outside of the SVM objective function (i.e. left of the 
QP expression for Id SVM) since it only depends on the data. That is, our proposed elimination 
criterion (jl7p only requires Id SVM retraining (in the same precise way as our earlier QP criterion 
(llip ) and subsequently a multiplication (by r^jcj That is, the criterion only requires little compu- 
tation, and thus can be performed in conjunction with each feature elimination step. Fig. [H for 
the Duke Breast Cancer gene dataset with 7129 features, illustrates that a large gain in accuracy 
is achievable by employing our second proposed method BMFE-QPemb instead of our first proposed 
method MFE-QPemb, for both nonlinear kernels used herein 1^ 

Above we have been supporting that our proposed method BMFE-QPemb ()17p is crafted better 
than the alternative method ()16p which eliminates by direct bound minimization. We name this 
alternative the "BFE" method (as opposed to "BMFE" ) and now for illustrative purposes we couple 
BFE with our QP approach and arrive at "BME-QPemb" ; i.e. the particular approach we employ 
in order to provide the discussed quantities -R^JJ and llvif"™'!^ to the BFE approach (fTBI) is the QP 
approach. Consistent with our comments for ()16p . the results for BME-QPemb in Fig. [1] illustrate 
that this method achieves lower generalization accuracy than our proposed method BMFE-QPemb 

Notice in Fig. [T] that it lastly also illustrates results for "BMFE-Slack" , where, as the name 
suggests, we are defining this method by simply combining the BMFE approach (of multiplying by 
square of the data radius) with the MFE-Slack criterion (instead of combining BMFE with the QP 
criterion); i.e. BMFE-Slack is simply ([6]) with put into its objective function as a multiplicative 
term. The results illustrate that this, unlike BMFE-QPemb, is resulting in a generalization perfor- 
mance not significantly better than MFE-Slack's itself, which re-emphasizes that our proposed QP 
criterion is much more effective than the MFE-Slack criterion. 

Our BMFE approach (of multiplying the SVM-based objective function by square of the data 
radius) can be combined, in future work, with the LO criterion (instead of combining BMFE with 
the QP criterion); this approach would have the limitations of the LO approach, described above. 

4 Results and Discussion 

We performed experiments to compare our proposed methods (MFE-QPemb and BMFE-QPemb) 
with every previously proposed MFE-based method [l]: 1) MFE-Slack; 2) its separability-requiring 
counterpart 'basic MFE'; 3) their hybrid MFE/MFE-Slack, herein named MFEhybrid i.e. the three 
best performers in among MFE-based methods; 4) MFE-LO; and their "FRsub" variants that 
perform stepwise "full (SVM) retraining (Fi?)" subsequent to the decision to eliminate a particular 
feature (as discussed in Sec. [2]) e.g. MFE-Slack-FRsub; as weh as 5) RFE-FRsub [11^ 

Our proposed methods can be compared with additional feature selection methods such as 
sparse estimation methods that have been researched extensively over the past several years, such 
as the state-of-the-art Lasso and £i-based logistic regression. Although we leave this comparison 

^^Note that itself can be computed with little computation cost using recursion. 

^'^By contrast, these two methods perform similarly in our earlier illustrative Figure which was for the 
Colon Cancer gene datset (Fig. [1]). 

^^Since RFE (IJj performs full (SVM) retraining (i.e. "FR") sufesequent to the feature elimination decision, 
herein we refer to RFE as RFE-FRsub. 



for future work, we note that our proposed methods could potentially have an advantage over 
sparse estimation methods when nonlinear kernels are considered, even under our current method 
definitions herein where we do not yet employ the potential benefit of hyperparameter tuning during 
the feature elimination steps themselves, as we discussed above. For the nonlinear case, note that 
herein we provide extensive results for our proposed methods. 

The experiment procedure we used for training of the initial SVM classifier used by all feature 
elimination methods evaluated herein was the following common one. The dataset is randomly 
split 50-50% into a non-heldout (training) set X and a heldout (test) set X. Each such split 
defines one "trial". 5-fold cross-validation is then performed on the non-heldout set, to select 
classifier hyperparameters for each trial, from amongst a candidate set of hyperparameter values. 
The classifier is then retrained for these hyperparameter values using all of X. 

As discussed in [IJ, Cover's linear dichotomy theorem [6] states that the probability that a 
training set (with points in general position) is linearly separable is very close to 1 when the 
number of samples is not bigger than the number of features plus one. For example, for the gene 
data and biomedical imaging data domains, typically there may be a huge number of features 
{e.g. 7000 or more) but no more than a few hundred patient samples [1]; in this case, it is highly 
probable that the training set will be separable while eliminating all the way down to a few hundred 
features. Accordingly, as also discussed in [T], when the data is initially separable, if the data is 
high-d (e.g. with thousands of features), "backward elimination methods" - such as all methods 
we are evaluating herein - may be able to eliminate all the way down to relatively few features (e.g. 
tens; or few hundred) without losing separability, whereas for low-to- intermediate dimensioned 
data separability may be initially achieved but lost relatively sooner e.g. when half of the original 
features - rather than relatively few - is still left to eliminate. By contrast, as a third category, 
low-d data (e.g. with 20-30 features) may often start (and remain) nonseparable. In this third 
case, not only is it not possible to evaluate approaches such as LO that require separability but 
also in our work herein we instead focus on high- or intermediate-dimensioned data since they are 
the two cases feature selection is most urgently needed; accordingly, herein we have broken up our 
experimental comparisons into the first two categories above. 

High-Dimensional Datasets: For this category, defined above, we used the same datasets as 
previously used by [Ij, having thousands of features and only tens of samples (i.e. the gene datasets 
Duke Breast Cancer, Leukemia, Colon Cancer, obtained from the LIBSVM website [5] with 7129, 
7129, 2000 features, respectively, and 38, 62, and 44 samples respectively) and a fourth dataset 
( UCI Arcene) with a somewhat high number of features (300 features). For the very high-d case (i.e. 
the three gene datasets), average test set classification error rate (i.e. an average across multiple 
trials) as a function of the number of retained features (which is reduced going from right to left) 
is shown in Figures [H U and [3l for two different nonlinear kernels (Gaussian and polvnomial rn. 
Notice from these six plots that the methods have formed four well-separated clusters i.e. tiers, as 
follows. The lowest two curves shown - MFE-LOemb and our proposed method BMFE-QPemb - 
form the best-performing tier (i.e. the smallest classification error rate). The next best tier - tier 
2 - is formed by our other proposed method MFE-QPemb. Next, methods that perform stepwise 
full SVM retraining (i.e. methods with "FRsub" in their name) have clustered to precisely form 
tier 3. Lastly, 'basic MFE' and MFE-Slack - which are the only two methods that do not perform 
any type of classifier retraining - have formed tier 4 where we see the poorest performance. 

Several comments should be made at this point. First, as explained in an earlier Section above, 
the results we present herein for "full (SVM) retraining (FR)" and LO are more proper results than 
the results presented for these two classifier retraining approaches earlier in [T] ^ Keeping this in 
mind, notice that 'basic MFE' and MFE-Slack i.e. non-retraining methods, which had achieved the 
best performance in [l] , are significantly outperformed here, forming the poorest tier - tier 4 - falling 
behind their FR-based variants (tier 3) as well as LO (tier 1). Second, let us now consider the two 
top performers - MFE-LOemb and our proposed method BMFE-QPemb - that are forming tier 1. 



^^In these Figures ([1] |4l [3]), the number of trials averaged, respectively, are 5, 10, 10 for the Gaussian 
kernel and 4, 3, 8 for the polynomial kernel; they are the initially separable trials we generated. 



^■^Recall that LO is a type of (light) classifier retraining. 
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Figure 3: Average test set classification error rate for the high-d Leukemia gene dataset that has 
7129 initial features. 
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Figure 4: Average test set classification error rate for the high-d Duke Breast Cancer gene dataset 
that has 7129 initial features. 
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Figure 5: UCI Arcene dataset. 



We begin by recalling that our experiments herein for method evaluations are actually somewhat 
favorably biased towards MFE-LOemb rather than our proposed BMFE-QPemb because, as we 
described earlier, while performing hyperparameter tuning during elimination may benefit BMFE- 
QPemb, herein we did not do this; in future work we may, in order to explore whether BMFE-QPemb 
improves upon such tuning, whereas, by contrast, MFE-LOemb performance is not going to change 
since such tuning is not applicable to the LO approach by definition. A second reason to advocate 
our proposed approach BMFE-QPemb is that separability - i.e. strictly satisfying the margin - 
which is required by MFE-LOemb (and not required by our proposed method BMFE-QPemb), may 
lead to overfitting when training samples at the margin are outliers or even mislabeled samples, 
as mentioned earlier. To evaluate this, in future work we may run experiments upon artificially 
mislabeling some samples. As a third supportive point for our proposed method BMFE-QPemb, 
note that it would be worthwhile to explore in future work a hybrid of these two best-performing 
methods; e.g. BMFE-QPemb can be used at elimination steps at which data separability has been 
lost, with MFE-LOemb used at the other (separable) steps. 

Our final dataset for the high-d category, UCI Arcene, has a somewhat high number of features: 
only 300, not thousands, taken from the pool of available features for this dataset. Here the result, 
as shown in Fig. is that tier 3 - i.e. the tier of methods that perform full SVM retraining - as 
a whole has shifted downward and become the best-performing tier, outperforming the LO and QP 
approaches that, by contrast, only employ light classifier retraining. This is not surprising since 
generally feature elimination can potentially benefit from (aggressive) full SVM retraining, as seen 
here, if the number of features is not too high such as 300 as seen here (since, as we mentioned 
earlier, overfitting, being a cumulative effect, depends on how high a number of elimination steps 
are accumulating.) 

Low-to-Intermediate Dimensioned Datasets: Defined above, this experiment category is 
for datasets which start the feature elimination process separable but, unlike the high-d category, 
do not maintain that separability all the way down to relatively few features; e.g. datasets that lose 
separability after only half of their features are eliminated, such as the Splice Scale dataset whose 
results we show in Fig. Notice in this Figure that our proposed method BMFE-QPemb remains 
a best performer whereas MFE-LOemb does not, because MFE-LOemb, unlike in the high-d case, 
is no longer usable (applicable) for the vast majority of the elimination steps (due to separability 
being lost after the elimination of half of the features). 

The full set of results we presented herein - for five different datasets and two different nonlinear 
kernels - show that our proposed methods achieve their best performance when data dimensionality 
is high, in the thousands. In fact, when data is high-d, notice that our proposed method BMFE- 
QPemb is the best-performing method among all methods for the majority of experiments (i.e. for 

4 out of 6 distinct (dataset,kernel) pairings presented in Figures HI [3l [1]) . 

5 Conclusions 

For SVM-based feature selection, we introduced two new, novel, fast and computationally low-cost, 
"backward feature elimination" approaches theoretically motivated by Statistical Learning The- 
ory and margin maximization central to SVM learning, and showed that these approaches enable 
achieving higher or competitive generalization accuracy - lower or similar test set classification 
error rate - than previously proposed SVM-based MFE (Margin-Maximizing Feature Elimination) 
methods. Our first proposed approach, named "QP" , performs lightweight SVM retraining - equiv- 
alent to training an SVM for 1-dimensional data (which has very little computational cost) - that 
adjusts the current solution in the reduced feature space (i.e. upon candidate feature elimination) 
to aim to obtain a lower value for the SVM objective function in the reduced space. This approach 

^"^The number of trials averaged was 22 for the Gaussian kernel and 24 for the polynomial kernel - they 
are the initially separable trials among the 30 trials we generated. 

^^The number of trials averaged was 14 for the Gaussian kernel and 13 for the polynomial kernel - they 
are the initially separable trials among the 30 trials we generated. We obtained the Splice Scale dataset from 
the LIBSVM [5] website. 
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Figure 6: Average test set classification error rate for the low-to-intermediate dimensioned Splice 
Scale dataset. 



is consistent with theory; i.e., the function we are minimizing - to shghtly alter the decision bound- 
ary - is essentially the same one minimized by the standard SVM formulation. QP differs from 
the previously proposed margin-maximizing (i.e. MFE-based) approach LO (Little Optimization) 
due to being usable even when the data has become nonseparable during the feature elimination 
process; unlike LO, we treat both the SVM slackness parameters and the SVM hyperplane intercept 
as parameters to be optimized, conveniently via a very low-cost optimization formulation. QP also 
differs from another previously proposed MFE-based approach, MFE-Slack, due to allowing the 
decision boundary to be altered - lightly, not aggressively - during the feature elimination process. 
To embed this novel QP approach into the feature elimination decision - consistent with margin 
maximization - we proposed a feature elimination method named MFE-QPemb. We found that 
the generalization accuracy achieved by this method is higher than MFE-Slack's but lower than 
that of the previously proposed LO-based MFE method (named MFE-LOemb herein). 

Our second proposed approach BMFE (Bound-Based Margin-Maximizing Feature Elimination) 
is motivated by a well-known theoretical upper bound in Statistical Learning Theory - by Vapnik 
et al. - on the "expected risk" of making a classification error. Combining BMFE with the above 
approach of embedding QP into the elimination decision, while staying consistent with margin 
maximization central to SVM learning, we proposed the feature elimination method BMFE-QPemb, 
where the computation cost is, again, like our first method above, little, i.e. essentially again 
equivalent to Id SVM training. We found that this BMFE-QPemb method brought an improvement 
in generalization accuracy over our MFE-QPemb method proposed above, consistently achieving 
generalization accuracy that is higher than or similar to MFE-LOemb's, for high- dimensional data, 
for especially which feature selection is urgently needed. 

In summary, for high-dimensional data, which is becoming more widespread in practice, we 
introduced a feature selection method named BMFE-QPemb - whose premise, consistent with 
margin maximization, is both light retraining of the classifier and consistency with a well-known 
bound on "expected risk" (of making a classification error) in Statistical Learning Theory - and 
showed that our proposed method achieves higher or competitive generalization accuracy - lower or 
similar test set error rate - than earlier MFE-based feature selection methods and RFE. We found 
this to be true even when these earlier methods stepwise employ the computationally more costly 
approach of full SVM retraining (i.e. stepwise recalculation of all SVM weights from scratch). In 
particular, we found that our proposed BMFE-QPemb method and the previously proposed MFE- 
LOemb method are the two best performers with respect to generalization accuracy, and noted 
some limitations of the previous method MFE-LOemb, such as i) not being usable whenever the 
data becomes nonseparable during the feature elimination process, ii) potential overfitting in the 
potential event the training samples at the margin are outliers or even mislabeled samples, and ii) 
not incorporating a means of hyperparameter tuning into the elimination criterion. 

6 Appendix: A fast algorithm for training a Id SVM 

Rather than solving the Id SVM in the usual way an SVM is solved (by using a "dual" solver), 
an SVM training algorithm specifically designed for significantly reduced computation for the Id 
case can be utilized. The algorithmic, which is a simple one, is based on two key SVM properties 
unique to Id data: 

PI) In each class, there is only one (and exactly one) "margin-setter", the training sample zj 
located at margin distance to the boundary, i.e. the sample whose discriminant function value gj 
(i.e. yj{wzj + b)) is equal to 1@ It can be easily seen from this equality that w is 2/r where r 

Much of the. algorithm has been previously documented in 19l , a brief 4-page technical report document 
that we found online at the website of one of the authors. 

^^To see why this holds, consider the following SVM details [31[T2]. In Id, each support vector in class 1 
has the same coordinate (and similar result holds for the other class) because in Id there is only one scalar 
Zj that satisfies the above margin-setter equation yj{wzj -|- 6) = 1. If we suppose there are multiple support 
vectors (which, by the above requirement, would be at the same coordinate), an identical solution can be 
produced by setting the Lagrange multiplier of one to a (positive) number and the rest to [19] . 



is simply the (scalar) subtraction of one class's margin-setter from the other's; and subsequently 
b can be computed as yj — wzj using any sample Zj with zero slackness. Thus, given a candidate 
margin-setter pair (where each such pair consists of one margin-setter per class), computing the 
associated w and b only takes two scalar additions and multiplications. 

P2) The number of "margin-violators" (i.e. training samples with nonzero slackness) is equal 
for the two classesP^ 

Accordingly, the algorithm to train the Id SVM, with four steps, is: 

Step 1) Consider these two distinct scenarios: SI) assume class 1 lies to the right of class 2; S2) 
vice versa. For each scenario, run Steps 2 and 3. 

Step 2) Determine the scenario's set of candidate margin- setter pairs, where each pair consists 
of one margin-setter sample per class; i.e., among all pairs of margin-setters (with one setter per 
class) that are suitable for PI, determine those that additionally meet P2. 

Step 3) Specific to this particular scenario (SI or S2), choose the candidate that minimizes the 
SVM objective function w"^ /2 + C^lCn- 

Step 4) Among the two winning candidates - one per scenario - determined above, choose as 
final the one that produced the smaller objective function value. 

To minimize computation for the task of determining all candidate pairs for Step 2, one can 
first determine the one whose members (the two per-class margin-setters) lie closest to each other, 
because, given a pair a guaranteed next one is simply the next two samples - one sample per class 
— that are located in the direction towards the opposite class. Thus, one would first sort within- 
class samples in that direction, as a "pre-processing" step for the algorithmic By nature of the 
specific sorting order, the two samples that make up a candidate pair - denoted here as sample 
p for class 1 and sample q for class 2 - share the same post-sorting (within-class) index, which 
thus also serves as pair indexlf^ Given a pair index j, the set of margin- violators, which is needed 
by the sum in Step 3, is readily identified by indexes k > j. This sum can thus be written as 
E = E (1 - (PkW + &)) + E (1 + (QkW + b)) = 2{N - j) - w{d+ - dj), where d+ = ^ Pk and 

n k>j k>j k>j 

dJ = J2 Qk- Since dJ = d'j_^i ->rPj+i and dJ = dj_^^ + qj+i, which are recursive equations, the sum 

k>j 

of Step 3 can be computed recursively (to reduce computation). 

n 
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